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Consider binary observations whose response probability is an 
unknown smooth function of a set of covariates. Suppose that a prior 
on the response probabihty function is induced by a Gaussian process 
mapped to the unit interval through a link function. In this paper we 
study consistency of the resulting posterior distribution. If the covari- 
ance kernel has derivatives up to a desired order and the bandwidth 
parameter of the kernel is allowed to take arbitrarily small values, we 
show that the posterior distribution is consistent in the Li-distance. 
As an auxiliary result to our proofs, we show that, under certain 
conditions, a Gaussian process assigns positive probabilities to the 
uniform neighborhoods of a continuous function. This result may be 
of independent interest in the literature for small ball probabilities of 
Gaussian processes. 

1. Introduction. Consider a binary response variable Y corresponding 
to a d-dimensional covariate x. The problem is to estimate the response 
p{x) = P(y = l\x) over the entire covariate space based on an increasing 
number of observations. We assume that the possible values of the covariate 
lie in a compact subset X C W^. A Bayesian method for estimating p was 
developed in [4]. A prior on p was induced by the relation p{x) = H{r]{x)), 
where r/ is a Gaussian process indexed by X and H isa known strictly increas- 
ing, Lipschitz continuous cumulative distribution function on M. Choudhuri, 
Ghosal and Roy [4] described algorithms for computing the posterior distri- 
bution of p and numerically investigated the properties of the posterior. 

In this paper we show consistency of the posterior distribution of p, where 
the prior is assigned through a Gaussian process as in [4]. Statistical pro- 
cedures are often justified by asymptotics, and posterior consistency plays 
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a major role in validating a Bayesian method. The posterior distribution 
is said to be consistent if the posterior probability of any small neighbor- 
hood of the true parameter value converges to one. Because the notion of 
consistency is dependent on the topology used to define the neighborhoods, 
one needs to consider an appropriate topology such as the one based on the 
Li-distance. Because consistency of p is directly related to the distribution 
of the covariate values, it makes sense to consider Li-distance weighted by 
the distribution of the covariates or their empirical measure. In the next sec- 
tion we present three different consistency results, for a random covariate 
with respect to the Li-distance based on the distribution of covariates, for 
a designed covariate with respect to the Li-distance based on the empiri- 
cal distribution of the covariate, and finally for a designed one-dimensional 
covariate with respect to the Li-distance based on Lebesgue measure. The 
results hold provided that the covariance kernel of the Gaussian process has 
a certain number of derivatives. We show posterior consistency by verifying 
prior positivity and entropy (or testing) conditions of the general posterior 
consistency results of Ghosal, Ghosh and Ramamoorthi [6] or Choudhuri, 
Ghosal and Roy [3]. An interesting alternative approach to posterior consis- 
tency was given in Walker [13]. 

In the course of our proof we derive two important auxiliary results. First, 
we show that a Gaussian process assigns positive probability to any uniform 
neighborhood of a function in the reproducing kernel Hilbert space of the 
covariance kernel. This result is of significant general interest. Second, we 
establish a probabilistic bound on the supremum of the derivative of Gaus- 
sian processes with covariance kernels that are differentiable up to a certain 
order. 

The complete flexibility in the shape of the sample paths of a Gaussian 
process makes it an interesting prior for other function estimation problems, 
such as density estimation or regression function estimation on a bounded 
interval. The Gaussian process prior was first used in the context of den- 
sity estimation by Leonard [10] and Lenk [9]. Posterior consistency of the 
resulting procedure was recently shown by Tokdar and Ghosh [11]. In the 
context of additive error nonparametric regression, Choi and Schervish [2] 
established posterior consistency under certain conditions. Following our ap- 
proach, it seems possible to treat other generalized regression models, such 
as Poisson regression, in a similar manner, although the test construction 
method will be problem specific. The natural extension of consistency re- 
sults will be the characterization of the posterior rate of convergence in the 
sense of Ghosal, Ghosh and van der Vaart [7]. Some of the results obtained 
here may be useful for that purpose as well. 

The paper is organized as follows. In the next section we state our main 
results. Positivity of uniform balls under the Gaussian measure is shown in 
Section 3. In Section 4 we obtain a useful result on the tail of a Gaussian 
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process and its derivatives, which is subsequently used to show that a cer- 
tain function sieve only spares an exponentially small probability under the 
Gaussian process prior. Tests with exponentially small error probabilities for 
testing a function against the complement of an appropriate neighborhood 
are obtained in Section 5. The results of these sections are used to prove the 
main theorems in Section 6. 

2. Main results. In this section we describe the model and the prior 
and present our main results. Let y be a binary response corresponding 
to a d-dimensional covariate x and p{x) = P{Y = l\x). Let the covariate 
values belong to a compact subset X of W^. Let if be a known strictly 
increasing, Lipschitz continuous cumulative distribution function on M and 
let r]{x) = H~^{p{x)). A prior on p{x) is induced by a Gaussian process prior 
on {r/(x):x G X} with mean function ^{x) and covariance kernel a{x,x') 
through the mapping p{x) = H{r]{x)). The covariance kernel is assumed to 
be of the form 



where o"o(-,-) is a nonsingular covariance kernel and the hyper-parameters 
r > and A > play the roles of a scaling parameter and (the reciprocal 
of ) a bandwidth parameter, respectively. Let the hyper-priors on r and A be 
r ~ n,- and A ~ Hx , respectively, where Ilr and 11;^ are absolutely continuous 
probability measures on 

Theorem 4 shows that the sample paths of the Gaussian processes can 
approximate a large class of functions very well and thus, for the purpose of 
posterior consistency, it is not necessary to consider additional uncertainty 
in the link function H. In fact, the parameter r could be taken to be a 
fixed constant without affecting posterior consistency. However, practical 
considerations of small sample accuracy suggest putting a suitable prior on 
r. Likewise, it is also sensible to consider the possibility of the presence of 
hyper-parameters in the "trend function" see Remark 2. On the other 

hand, it is necessary to vary the bandwidth parameter A all over (0, oo) to 
obtain posterior consistency. 

We shall work with the sieve of response probability functions 



(2.2) e„ = Qn,a = {p{-):p{x) = H{r]{x)), p"'7?||oo < M„, \w\ < a}; 

here and below D'^t] stands for {d^"^^ /d'^'^ti ■ ■ ■ d'^'^td)r]{ti, . . .,td), \ w\ = Y,Wj, 



a is some positive integer and M„ is a sequence of real numbers. Let se- 
quences A„ and r„ be such that 11,- (r < r^) = e~^^ and nA(A > A^) = e"'^", 
for some constant c. Specific forms of the hyper-priors and the sequences 
will be discussed later. 



(2.1) 
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Let 

A = I r]{x) = ^ajfJo(Ax,Ati), ai,...,afc GM, 

(2.3) 

GX,fc> l,A>oL 

Then A, the closure of A in the supremum metric, is called the reproducing 
kernel Hilbert space (RKHS) of ctq (or, equivalently, of a). We make the 
following assumptions. 

Assumption (P). For every fixed x G X, the covariance function <to(x, •) 
has continuous partial derivatives up to order 2a + 2, where a is a positive 
integer to be specified later. 

The mean function fj,{x) belongs to the RKHS, A, of the covariance kernel 
o-o(-,-)- 

The prior 11^ for A is fully supported on (0,cx3). 



Assumption (C). The covariate space X is a bounded subset of W^. 

Assumption (T). The transformed true response function rjQ belongs 
to A. 



Assumption (T) implies that r/o is uniformly bounded above and below, 
and hence, po{x) = H{r]Q{x)) is bounded away from and 1. In our setup, 
the free quantities are a and the sequences M„, An and Tn- We do not require 
and Hx to have specific forms as long as they satisfy some tail conditions 
specified by the magnitude of the tail cut-off points A„ and r„. The quantity 
a specifies the smoothness of the covariance kernel. The numbers a, A^^i A^ 
and Tn need to satisfy some interrelation described by the following growth 
condition. 



Assumption (G). For every 6i > and 62 > 0, there exist sequences 
Mn, Tn and A„ such that 

MnTnXn^ > hn and M^/" < 63^. 

The first part of Assumption (G) will be used to prove exponential decay 
of the prior probability of the complement of the sieve and the second 
part will be used to bound the uniform entropy number of 0„ . 

Now we state our main results under different specifications for the co- 
variate values. 
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2.1. Random covariate. Let Pq denote the true distribution of the whole 
data. We first state the posterior consistency result for the case where the 
covariates arise as a random sample from a distribution Q on X. 

Theorem 1. Suppose the random covariate X is sampled from a proba- 
bility distribution Q on X. Suppose that Assumptions (P), (C), (T) and (G) 
hold. Then for any e > 0, 



^{p'-J \p{x) -po{x)\dQ{x) > e 



Yl, ■ ■ ■ 7 Xi, . . . ,Xy 



in Pq -probability. 



The covariate measure, Q, may be viewed as a fixed quantity or a nuisance 
parameter. When Q is a fixed quantity, the posterior distribution does not 
depend on Q. Therefore, to evaluate the posterior, we need not actually 
know Q. On the other hand, when Q is treated as an unknown parameter, 
we need to specify a prior on Q. Under the natural assumption that p is 
unrelated to Q, the likelihood for p can be separated out from that of Q. 
Thus, with independent priors, p and Q will be independent a posteriori, 
and hence, the posterior distribution of p may be obtained without even 
specifying a prior on Q. Note that the posterior for p will be computed the 
same as in the case of fixed covariates. 

If the covariate measure Q permits a Lebesgue density, then we have the 
following trivial corollary. 

Corollary 1. // the covariate distribution Q has Lebesgue density q 
which is bounded below by some positive constant, then under the conditions 
of Theorem 1, consistency in the usual Li-distance J \p{x) —poix)\ dx holds. 

2.2. Designed covariate. In the case of fixed design, often the entire set 
of covariate values changes with the sample size. This is the case when the 
covariates are chosen on some equally spaced grids. Thus, the covariate val- 
ues form a triangular array of the form i = 1, . . . , n}, where repetitions 
are allowed. Let Qn be the empirical measure of the design points defined as 
Qn = J27=i „; where 5x denotes the unit mass probability at x. Then 
we have consistency with respect to the Li-distance based on the empiri- 
cal measure. Such a distance automatically adjusts to the concentration of 
the covariates and appears to be more intrinsic than the Li-distance with 
respect to a fixed measure not related to the distribution of the covariate 
values. 

Theorem 2. Assume that the covariate values arise from a fixed design. 
Then under Assumptions (P), (C), (T) and (G), for any e > 0, 

u(^p: J \p{x)-po{x)\dQn{x)>e\Yi,...,Yr^ ^0 
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in Pq -probability. 

2.3. One- dimensional covariate. In the case of a one-dimensional non- 
random covariate, one can also obtain consistency in the usual Li-sense 
under an additional assumption on the covariate values. Without loss of gen- 
erality, assume that the covariate values Xi^m i = 1, ■ ■ ■ ,n, are in ascending 
order. Let Si^n = a^i+i,n — Xi^n be the spacings between consecutive covariate 
values. 



Assumption (U). Given 5 > 0, there exist a constant Ki and an integer 
such that, for n> N, we have that J2i:Si n>Kin-^ Si,n < ^■ 

The assumption merely states that the measure of the part of the design 
space where data is sparse is small. Obviously, Assumption (U) is satisfied 
by any regularly spaced design. If the design was chosen by sampling from a 
nonsingular distribution, then by the properties of spacings, it can be shown 
that Assumption (U) holds with probability tending to one. 

Theorem 3. Suppose that the values of the covariate arise as design 
points on X satisfying Assumption (U) and X is a bounded interval of M. 
Assume that the prior satisfies Assumption (P). The mapping x ^ r]Q{x) and 
the prior mean /i(-) are assumed to have two continuous derivatives on X and 
the covariance kernel a{-, ■) is assumed to have continuous partial derivatives 
up to order 6. Assume that Hr and Hx are such that r~^A^ = 0{n). Then 
for any e > 0, 



n^p: J \p{x) — po{x)\ dx > e 



Yu...,Yn]^0 



in -probability. 



2.4. Remarks. 



Remark 1. If n^(T < T) = 0(e-^/^'') as T ^ and Ha (A > L) = 0{e-^^') 
as oo, for some b,r,s > 0, it then follows that = n~^^^ and = n^^** 
satisfy the exponential tail requirement. Then by both assertions of Assump- 
tion (G), we have that 

which implies that s > d and a > (1 + r"^) / {2{d~'^ — s~^)). In the most 
favorable case when r — oo and s — > oo, we need a > d/2. In other words, 
with the most favorably tailed priors on r and A, we need to assume the 
existence of at least d + 3 derivatives of the covariance function, with the 
requirement going up if the tails are thicker. The natural conjugate prior 
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on r is a gamma prior, which assigns too much probabihty to the lower tail 
and, hence, does not seem to be good in this respect. A better choice would 
be the inverse gamma prior which corresponds to r = 1 and imposes the 
restriction a > sd/(s — d). As there is no natural conjugate prior on A, it 
makes sense to use a prior with a very thin tail, such as H\{\ > L) < . 
For such thin tails the restriction on a reduces to a> d. Then we will need 
to assume the existence of at least 2d + 3 derivatives of the covariance kernel 
for our results to hold. 

Remark 2. For practical considerations, it is useful to allow hyper- 
parameters in the mean function /u(-). For instance, the mean could be 
taken as a linear combination of a fixed number of functions, so that fM{x) = 
J2j=iPj'^jix)- A lower degree polynomial is often a good choice. In or- 
der to establish posterior consistency under this scenario, one needs to en- 
sure that the tail of the distribution of (3 is thin enough. For instance, if 
P(||/3|| > B) = 0{e-^^^^^") for some ci > 0, then > cgW'^) has ex- 
ponentially small prior probability for any C2 > 0. Now for any /5 with 
\\(3\\ < C2n"/'^, the complement of the sieve defined by (2.2) continues to 
have exponentially small prior probability in view of (4.2) below, provided 
that C2 is chosen small enough depending on 62 in Assumption (G). 

Remark 3. A popular method of prior construction on functions is 
by expanding the function in a series J2(^j'4'j{x) and then putting inde- 
pendent N{0,Tj) priors on the coefficients. Such a prior leads to a Gaus- 
sian process prior on the function, where the covariance kernel is a{x,y) = 
J2jTj''Pj{^)'^j{y)- Under appropriate differentiability conditions, our results 
imply posterior consistency at any po = H{r]Q), where rjQ belongs to the 
RKHS of a. 

3. Probability of uniform balls. In this section we establish a property 
of the support of a Gaussian process which is also of general interest. Let 
{W{t),t G r} be a Gaussian process indexed by a compact set T C M'^, which 
we take as [0, l]'^ without loss of generality. Let the mean function of W{t) be 
/i(t) and the covariance kernel be a{s, t) = T~^ao{\s, At), where do is a fixed 
covariance kernel, and r > and A > are parameters that can possibly 
vary according to some distribution. We are interested in finding conditions 
under which P{||VF 

— 'f^llcxD < e) > for some nonrandom function wit). Such 
a result was recently obtained by Tokdar and Ghosh [11] using an approach 
based on conditioning the process at some grid points and then establishing 
bounds on the conditional mean and variance. Here we provide a shorter 
proof based on the Karhunen-Loeve expansion of the process. 

It suffices to show that the result holds for A and r varying over a set of 
positive probability. For our purpose, r can be fixed, as the basis function in 
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the Karhunen-Loeve expansion is independent of r. It follows from Lemma 2 
of [11] that it suffices to fix A at some suitable value. Thus, with fixed r and 
A, is a Gaussian process. We intend to show, under appropriate conditions 
on the covariance kernel, that w belongs to the support of the mixture of 
Gaussian process priors. First, we establish that a Gaussian process, under 
mild conditions, assigns positive probabilities to the uniform balls around 
functions in the RKHS of the covariance kernel. 

Theorem 4. Assume that {W{t),t G T} is a Gaussian process with con- 
tinuous sample paths having mean function fi{t) and continuous covariance 
kernel a{s,t). Assume that fi{t) and a function w{t) belong to the RKHS of 
the kernel a{s,t). Then 

(3.1) p( sup\W{t) -w{t)\ <e) >0 foralle>0. 

Proof. We may assume without loss of generality that /i(t) is the zero 
function; else we can subtract ^{t) from W{t) as well as from w{t). 

Let X^fe^i V^?iV'i(0 be the Karhunen-Loeve expansion of W{t)^ so that 
Aj's are the eigenvalues of the kernel operator c7(s,i), il^i{t) are the corre- 
sponding eigenfunctions and are independent A''(0, 1); see [1], Section III. 3. 
Let w{t) also be represented as X^fc^i V^«i'0i(*)) where X^i^i •^i^? < °o. 
It follows from Mercer's theorem (cf. Theorem 3.15 of [1]) that the series 
Y^i^iyf^iO-ii^iii) converges uniformly, and hence, the tail sum is uniformly 
small. 

Bound \W{t)-w{t)\ as 

(3.2) suv\W{t)-w{t)\<suv\WN{t)-WN{t)\+suv\wN{t)\+suv\WN{t)l 

t^T t£T t£T t£T 

where WN{t) = Ef=i^iCMt), WN{t) = EZN+i^idMt), WN{t) = 
Eili VKaiil)i{t) and iVN{t) = Ei^Af+i Vhai'^iit) ■ Let e > be given. The 
second term on the right-hand side of (3.2) is nonrandom and less than e/3 
for N large enough by the uniform convergence. 

The basis expansion of the Gaussian process WN{t) — wiy{t), for any given 
N, has finitely many terms involving i.i.d. N{0, 1) variables and continuous 
function coefficients. Then, the nonsingularity of a normal distribution with 
nonsingular covariance implies that P(sup4g7i |VFAr(t) — WN{t)\ < e/3) > for 
any fixed A^. 

Now if P(sup4g7i |plV(t)| < e/3) > for some N, then exploiting the in- 
dependence of Wjy and Wn, it can be easily shown that (3.1) holds. Thus, 
it suffices to show that P(sup(g2" I^Af(*)l < e) — > 1 as — > oo for any fixed 
e > 0. However, as W{t) has continuous sample paths, by assumption, it 
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follows that 



pf sup|VF7v(t)| >e] = F(sup\WN{t) - W{t)\>e 



< e-^Ef sup \ WN{t) - W{t)\^ 



which converges to as A'" — > cxd by Theorem 3.8 of [1]. This completes the 
proof. □ 

Now assume that do is bounded away from zero on T x T. Let 



Let wq € £, where C is the closure of £. Then by the discussion preceding 
Theorem 4, it follows that P(supjgT \ W{t) — wo{t)\ < e) > 0, where W has 
the mixture of Gaussian processes distribution discussed there. 

For many covariance kernels, (t is dense in C{T), in which case every con- 
tinuous function will be in the support of the prior. For example, if d = 1 and 
(7o(s, t) = ip{s — t) for some nonzero, continuous density function ip on R, then 
Tokdar and Ghosh [11] showed that <t is dense in C(T). For higher dimen- 
sions, if the covariance kernel is the Kronecker product of one-dimensional 
kernels in the sense that cr((si, . . . , s^), (ti, . . . , td)) = 0"i(si, ti) • • • CTdisd, td), 
where each aj has RKHS C{Tj), j = 1, . . . ,d, then Tokdar and Ghosh [11] 
showed also that the RKHS of a is C(Ti x • • • x T^). For instance, it follows 
from there that the kernel a{{si, . . . , Sd), (ti, . . . , td)) = exp[— X]j=i Xj{sj — 
tj)'^] on X, where the Aj's are unrestricted, has RKHS C(X) for any product 
type compact domain X. 

4. Sieves and tail probabilities. 

Lemma 1. Let 0„ be as defined in (2.2) and Assumptions (P), (C) and 
(G) hold. Then n(G^) < Ae~^'^ for some constants A and c. 

Because nA(A > An) < e""^" and IIt{t < Tn) < e"*^", it suffices to uniformly 
bound the probability of for given A < An and t >Tn- The lemma will 
follow from the following result about Gaussian processes which could also 
be of general interest. 



(3.3) 




ai G M, G T, 1 < i < A;, A; > 1, A > 
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Theorem 5. Let r]{-) he a Gaussian process on X, a bounded subset of 
W^. Assume that the mean function is in C"(X) and the covariance 
kernel (t{-,-) has 2a + 2 mixed partial derivatives for some a > 1. Then 7]{-) 
has differentiable sample paths with mixed partial derivatives up to order 
a and the successive derivative processes D^ri{-) are also Gaussian with 
continuous sample paths. Also, the derivative processes are sub-Gaussian 
with respect to a constant multiple of the Euclidean distance. Further, there 
exists a constant dyj such that 

(4.1) pfsup|D"'r/(x)| >m) <i^(??)e-'^"'*^'/^»{'?) 

for w = {wi,W2,.-.,Wd), Wi £ {0,1, 2,..., a}, \w\ < a and cr^(r/) = 
sup^jg^ var(Z)'"ry(x)) < oo, K{rj) is a polynomial in the supremum of the 
{2a + 2) -order derivatives of a and the covariance functions of the derivative 
processes D^r]{x) are functions of the derivatives of the covariance kernel 

Proof. We may assume, without loss of generality, that the mean func- 
tion is identically zero, because for M sufficiently large, 

P(7? : ||7?||oo > M) < P(?? : ||r/ - /illoo > M - ll^ulloo) 

(4.2) 

<P(??:||r/-/i||oo>M/2). 

First we show that the process constructed by taking the partial deriva- 
tive of r] with respect to the jth component, Djr]{-), is again a Gaussian 
process with continuous sample paths and covariance kernel D'ja{-,-). Here 
and below, L)| is the partial derivative operator with respect to the jth 
components of both arguments of a, that is, L'|(t((si, . . . ,5^), (ti, . . . ,trf)) = 
{d'^ /dsj dtj)a{{si, . . . ,Sd), {ti, . . . ,td)). According to our general notation, 
Djr]{-) = D'^ri{-) and -D|cr(-, •) = D^D^a{-, •) componentwise, where w = Cj, 
the d-dimensional vector with one at the jth place and zeros elsewhere. 

To show DjT] is again a Gaussian process, we need to investigate the 
path properties of the one-parameter Gaussian process obtained by letting 
the jth component vary and holding all other d — 1 parameters fixed. For 
notational simplicity, we suppress the dependence of the process on the other 
d — 1 parameters. By Section 9.4 of [5] (a version of) ry(-) has continuously 
differentiable sample paths if 

\A,A,D]a{s,t)\<j^-^ as/i-0 

for some C > and a > 3, where 

AhAhD'^a{s,t) = D'^a{s + hej,t + hej) - D]a{s + hej,t) 

- D'-a{s,t + hcj) + D'ja{s,t). 



Ep,7?(,s)-Z),r/(t))2 = limE|; 
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Because cj(-, •) has bounded mixed partial derivatives of at least up to fourth 
order, the above condition is trivially satisfied and the process 'r]{-), as a 
process with respect to the jth coordinate, has continuously differentiable 
sample paths. The limit of a sequence of multivariate normal variables is 
again a multivariate normal, and Dj-qit) = lim/j^o(^(* + hej) — r](t))/h. 
It follows that Dj'q{-) is a Gaussian process. Moreover, 

■ r]{s + hcj) - r}{s) - rj{t + hej) + r]{t) 1 ^ 

h J ■ 

This follows by the uniform integrability of {r]{t + hej) — rjit))"^ /h?, which is 
a consequence of the fact 

^^ |^(t + /,e,)-r?(t)| y 

\a{t + hej,t + hej) + a{t,t)-2a{t + hej,t)\ \^ , or2 ^ 

for some constant Bq. Then, the intrinsic semimetric for the partial deriva- 
tive process is given by 

E(A-77(s) - Djr,{t)f = lim E{|7?(s + /le,) - 7]{s) - r]{t + hej) + 7?(i)|}V^^ 

/i— >o 

= lim /i~^{(t(s + hej,s + hej) + a{t + hej,t + hej) 

h—*0 

+ a{s, s) + a{t,t) + 2a{s + hej ,t) 
+ 2(t(s, t + hcj) 

— 2{a{s + hej,t + hej) + a{t, t + hej) 

+ a{s,t) + a{s,s + hej))}. 

Using the symmetry of the covariance function and by Taylor's expansion, 
we have, after simplification, that 

E{Dj7]{s) - Djr]{t)f = D^a{s, s) + D^a{t, t) - 2D'^a{s, t) 

(4.3) 

= a*{s,t), say. 



As the covariance function has bounded mixed partial derivatives up to order 
2a + 2, by Taylor's expansion, we have for C = sup{|D|(T*(s, t)| that 

(4.4) E(L>jr?(s) - Djri{t))^ < C\\s - tf. 

Thus, the partial derivative process is sub-Gaussian with respect to a con- 
stant multiple of the Euclidean distance. Note that from (4.4), the covariance 
kernel for Djr}{-) is given by 

(4.5) cov{Dj7]{s),Djri{t)) = D]a{s,t). 
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Further, as the kernel Dja{s,t) has at least two mixed derivatives with 
respect to each component, it follows from [5] that the multi-indexed process 
Djij has continuous sample paths. Thus, the sample paths of rj are, with 
probability one, continuously differentiable with respect to each argument. 

Replacing rj by Dji], the mixed partial derivative process Dj^Dj-q is again 
a Gaussian process and is sub-Gaussian with respect to a constant multiple 
of the Euclidean distance. In general, the derivative process D^'q{s) for 
\w\ < a is a Gaussian process with covariance kernel D^D^a{-, ■). 

Now alii]) = sup{var(L>'"r/(s)) : s G X} < oo. We thus have N{e,X, \\ ■ ||) < 
Ce""^ and N{e,X, \\ ■ \\p) < C"e~'^ for some constants C and C" depending 
on the measure of the set X and the kernel a. Here N stands for the covering 
number, || • || is the Euclidean distance and || • ||p is the intrinsic semi-metric 
of the derivative process D^r}{s). The result then follows by applying Propo- 
sition A. 2. 7 of [12], page 442, to each of the derivative processes. □ 

To complete the proof of Lemma 1, consider the kernel of the form 
cr(s,t) = T~^cro(As, At). Let ^ be a process with a fixed covariance kernel (Tq. 
Then the mixed derivative processes D^^{-) up to order a have uniformly 
bounded variances. Now for A < A„, t >Tn and \w\ = a, 

^l{ri) = r~'\'^ alii) <r-'\l" alii). 

The rest of the proof now follows easily from Theorem 5 because the con- 
tribution from K{r}) grows only polynomially in n. 

5. Entropy bounds and test construction. 

Lemma 2. The e-covering number iV(e, 0„,|| • ||oo)) in the supremuni 

norm, of@n defined by {2.2), is given by logA^(e, 0„, || • ||oo) < KMn^'^e~'^^°' 
for some constant K. 

Proof. The result follows immediately from Theorem 2.7.1 of [12], 
page 155. □ 

Lemma 3. Let v be a finite measure on X and let ipi and 'tp2 be measur- 
able functions such that < 'ipi,'ip2 < M and J \tpi — di^ > {1 + i^{X))e for 
some M, e > 0. Then i'{x : \il)i{x) — ip2{x)\ > e} > e/M. 

Proof. By the given condition, 

(i/(X) + l)e < / 1^1 (x) - ij2{x)\ dv{x) 

Jx: \il)\{x)—'il}2{x)\>£ 

(5.1) +/ \il)i{x)-il)2{x)\dv{x) 

Jx: \ipi{x)—ip2{x)\<e 

< Mu{x:\il)i{x) - i)2{x)\ >e}+ev{X). 
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The result now follows by rearranging the terms. □ 

Applying Lemma 3 to V'l = P, tp2 = Po and z/ = Qn, we obtain 

(5.2) In,p = #{xi:\p{xi)-po{xi)\>e}>K'n 

for some K'. Let = {x:p{x) > Po{x) +e} and A~ = {x ■.p{x) < po{x) —e}. 
Then either Ap or A~ contains at least K'n/2 points. For definiteness, as- 
sume that m = m„ = > K'n/2, where /+p = {i : G A^}. For a given 
p, to test the simple null po against the simple alternative p, we construct 
a test based on the observations corresponding to only those design points 
which are in Ap . Then (5.2) asserts that there is no loss of order of the 
number of indices. The next lemma is stated in a general framework and 
shows how to construct such a test. 

Lemma 4. Let Yj be independent Bernoulli variables with P{Yj = 0) = 
fij, j = 1, . . . ,m. Consider testing Hq : fij = hqj against Hi : fij = fiy, where 
fJ-ij > A*Oj + e for all j and < Eq < /^oj < 1 — eo < 1 / ^ere eo > and e > 
do not depend on m and e < e^. Consider the test = l{Z]jLi(^' ~ l^Oj) > 
me/2}. Then for all sufficiently large m, 

(5.3) Ep„(*™) < e"™^'/2^ Ep,(l - < e-^''/\ 

where Pq and Pi are respectively the probability measures under the null and 
the alternative. 

Remark 4. The above lemma also holds if fiij < fiQj — e for all j = 
1, . . . , 771 if the test is defined as one that rejects Hq for J2Y=i0^j ~ l^Oj) < 
—me/2. 

Proof of Lemma 4. By Hoeffding's inequality ([8], Theorem 1), 

Ep„(^„) = PoiZa - Ep,Ym > e/2) < 

For all sufficiently large m, another application of Hoeffding's inequality 
gives 

Ep, (1 - ^m) = Pi{Ym - Ep.Ym < e/2) 

(m 
{Ym - Epjm) + m-^Y.^iiij - ^oj) < e/2 

< Pi{{Y^ - Ep^y^) < -e/2) < e-^'''l\ 
This completes the proof. □ 
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For a given p, we consider the test ^n,p which rejects the simple nuh pQ 
against the simple alternative p if 

(5.4) ^ (yi-po(a^i))> W2. 

By Lemma 4 above, the test satisfies (5.3) for a simple alternative p. 

To remove the dependence on p, we use the standard technique of covering 
a set by small balls and estimating the covering numbers. Note that, for a 
fixed e > 0, if i G ^+ and - p\\ < e/2, then 

(5.5) P*ixi) -poixi) >p{xi) -poixi) - \\p-p*\\oo > e/2. 

Therefore, applying Lemma 4 (with e replaced by e/2), we obtain a test 
^„,,p such that Epo^-^^p < g-^^'/^ and Ep.(l - ^n,p) < 6""^="/^ 

With A'' = A^(e/2, 6„, || • ||cxd), get pi, . . . ,pn G ©n with the property that, 
for any p, there exists a pj £ 0„ such that \\p — Pj||oo < e/2. Consider the 
test = max(^'„,p^. , j = 1, . . . , N). Then 

JV 

(5.6) Ep„$„ < 5^Epo^'p^,„ < TVe-^^'/s = exp(log7V - m^Vs)- 

i=i 

If p G G.„, choose j such that — Pj||oo 

<e/2. Then 

(5.7) Ep(l - $„) < Ep(l - M/p^,„) < e-™^'/^ 

Then by Lemma 2, for any given constant 62 > we can choose a sufficiently 
small 62 and M„ satisfying Assumption (G) with 62, such that log < 6271. 
Then choosing m = m„ of order n, the test <!>„ satisfies the requirement 

(5.8) i?po^n<e-'='", ii;p(l - <^„) < e-'^'" 
for some constant c' . 

6. Proof of the main theorems. Now we prove Theorems 1-3. 

Proof of Theorem 1. We are considering the model 

(6.1) Yi\Xi'^ Binomialil, p{Xi)), X^'-Q, i = l,2,...,n. 

Then the joint density of X and Y with respect to the product of Q and the 
counting measure on {0, 1}, say, is given by f{x,y) =p(x)^(l — p{x)Y~y . 
The corresponding true joint density is fo{x,y) =po{x)y{l — po{x)y~y . Re- 
call that by Assumption (T), eo < ^0(2;) < 1 — eo for some Eq < 1/2. This im- 
plies that fo{x,y) > Eq. Also observe that / |/i — /2I d'i^dQ = 2 / |pi — P2I dQ 
and Jfolog{fo/f)di<dQ = J Polog{po/p) dQ + /(I - Po)log((l - Po)/(l - 
p))dQ. 
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We verify the conditions given in Theorem 2 in [6]. It may be noted that 
although their result is stated for Lebesgue densities on M, it is valid for 
densities in any measure space. 

We first show that n{/ : / /olog(/o//) < e} > for all e>0, where 11 is 
the prior for /, or equivalently, 

njp: Jpolog^dQ + Jil-po)log^-^^dQ <e^ >0 for all e > 0. 

We shall use the following lemma which follows easily from Taylor's expan- 
sion. 

Lemma 5. Let < eo < ^ o.'iT-d Sq < a, P < 1 — eo- Then there exists a 
constant L depending only on Sq such that 

log-j +(l-a)(^log^j <Lia-(3f, m = l,2. 

Let B = {p:\\p — pqWoc < ^c}, where c = inf{min(po(2;), 1 — Po(a:^)) : < 
X < 1} > and \\p — polloo = sup{|p(x) — poix)\ : < x < 1}. If p G 5, then it 
follows from Lemma 5 that 

(6.2) max(^y'polog^dQ,y'(l-po)log^^dQ) <L||p-po||L- 

Hence, it suffices to show that Il{p: \\p — polloo < e) > for every e > 0. Be- 
cause po{-) = H{t]q{-)) and the function u i— > H{u) is bounded and Lipschitz 
continuous, it is enough to show that 

(6.3) 11(77 : ll?] — ?7o||cxD < s) > for every e > 0. 

The result now follows from Theorem 4. 

To verify the entropy condition of Theorem 2 in [6], let (3 > be given. 
We consider the sieve J^n = {f{x,y) =p(x)^(l — p(x))^~^:pG ©n}) where 
e„, is defined in (2.2) with M„ = bn'^l'^ and 6 > is a constant to be cho- 
sen sufficiently small. By Lemma 2, for some constant K, we have that 
logiV(e,Gn, II • lloo) < Ke-'^/°'¥/°'n. Now choosing b < {P/K)°'/'^e, we can 
ensure that logA^(e,0n, || • ||oo) < nP. 

Finally, from Lemma 1, we have that 11(0,^) is exponentially small. This 
completes the proof. □ 

Proof of Theorem 2. For nonrandom covariates, the observations are 
independent, nonidentically distributed. We shall apply Theorem 2 of [3]. 
The prior positivity condition follows essentially by the same arguments 
used in the random covariate case. Consider the sieve Qn defined by (2.2). 
The condition (A3)(iii) of Theorem 2 of [3] holds by Lemma 1. To ver- 
ify their conditions (A3)(i) and (A3)(ii), we need to show that there exist 
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exponentially consistent tests for testing Hq:p = pq against an alternative 
Ha '■ \\p ~ Po\\i,Q„ > ^ for all e > 0. The test constructed in (5.8) satisfies the 
required conditions. □ 

Proof of Theorem 3. In this case we verify the conditions of Theo- 
rem 2 in [3] for the sieve defined by (2.2) with a = 2. In view of the proof 
of our Theorem 2, the only condition that needs to be additionally verified 
is the testing condition for the usual Li-distance. To construct the required 
sequence of tests, we estimate the number of covariate values where the true 
probability function po and the alternative p differ by at least a specified 
amount. The following lemma, where we assume without loss of generality 
that X = [0, 1], estimates that number. As the number is at least a fraction 
of n, it follows that required tests can be constructed as in the proof of 
Theorem 2. □ 

Lemma 6. For any p G Qn,2 such that J \p{x) — po{x) \ dx > 5e, {5.2) 
holds. 

Proof. For a given function h, let h{x, k,h) = k Yl!j=iUlj\)/k ^(*) ^ 
~ x)^~^ stand for the corresponding Bernstein polynomial of or- 
der k. Then it is well known (and easy to see) that sup{|/i(x) — 6(x, fc, /i)| : < 
rr < 1} < Ak~^ SMi>{\h" {x) \ :0 < x < 1}, where A is an absolute constant. 
Under given assumptions, the choice M„ ~ ^•n^l'^ satisfies Assumption (G) 
for a sufficiently small 7. Let 7' be a given small constant and let 7 be 
sufficiently small such that 7 < ^' eji^A). Choose a sequence k^ ~ 7'n. Let 
h{x) = b{x, kn,p) and bo{x) = b{x, kn,Po). Therefore, it follows from the above 
discussion that, for any p G @n,2, we have that — 6||oo < ^(7'n)^^7n < e/2, 
and \\po — boWoc < A{'j'n)~^jn < e/2. Hence, 

\p{x) -poix)\ > \b{x,kn,p) - b{x,kn,po)\ - \\p - b\\oo - \\po - ^olloo > 

if X £ Bp = {t:\b{t,kn,p) — b{t,kn,Po)\ > 2e}. Therefore, Bp C Ap := {x: 
\p{x) — Po{x)\ > e}, and hence, it suffices to show that the assertion with 
In,p replaced by I^ p = #{i : Xi £ Bp}. 

Clearly, < b{x),bo{x) < 1. Also, ||&- &0II1 > ||p-Po||i - ||p-&||oo - \\po - 
&0II00 > 4e. Applying Lemma 3 to the pair b and 60 and i^, the Lebesgue 
measure on [0, 1], and e replaced by 2e, we obtain that \Bp\ > 2e. 

Now, as b and 60 are polynomials of order at most kn, the set Bp is at 
most a union of k^ intervals. Let Ji, J2, . . . be these intervals. Find Ki such 
that Assumption (U) holds for S = e. Call a spacing interval (rEj^„, Xj+i,.„) 
of type I if Si^n ^ Ki/n. For any value of j, let be the union of type I 
spacing intervals (xj^„, Xj+i^„) that are completely contained in Jj. Note that 
at most two type I spacing intervals may be partially contained in Jj for 
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any j, which has total length bounded by 2Ki/n. Put B* = IJj J*. Thus by 
Assumption (U), \Bpr\{B*Y\ <e + 2Kikn/n and hence \B*\ > e-2Kikn/n. 
For j = 1, 2 . . . , let i?j be the number of type I spacing intervals completely 
contained in Jj . Considering the possibility that may not contain its end 
points, we find that contains at least Rj — 2 design points, and hence B* 
contains at least J2j ~ design points. To estimate Y^jRj, note that 
e-2Kikn/n < \{B*) < RjKi/n, and hence J2j Rj > ne/Ki — 2kn- Hence 
Bp contains at least (ne/Ki) — 4/c„ points, which is greater than ne/{2Ki) 
if we choose 7' < e/(8i^i). □ 

Acknowledgment. Suggestions of the referees have improved the exposi- 
tion of the paper. 
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